ENH: Automatically preserve links in added pages #3298

larsga · 2025-05-27T13:47:52Z

Here is a draft implementation of the first stage of the issue #3290 implementation. It handles links in pages added via add_page and insert_page, but it doesn't handle pages merged into those pages before adding.

Does this look OK?

I'm wondering if some users may already have written their own link patching code -- will this code break theirs? If so, should we make it possible to turn this behaviour off somehow?

At the moment I'm resolving everything by searching lists to find corresponding indirect references. It would be much faster with a hash, but I haven't been able to make that work. Thoughts?

codecov · 2025-05-27T14:07:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.91%. Comparing base (bfe7178) to head (d6d47e2).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3298      +/-   ##
==========================================
+ Coverage   96.89%   96.91%   +0.02%     
==========================================
  Files          54       55       +1     
  Lines        9263     9324      +61     
  Branches     1695     1706      +11     
==========================================
+ Hits         8975     9036      +61     
  Misses        172      172              
  Partials      116      116

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stefan6419846 · 2025-05-27T15:42:35Z

Thanks for the PR.

Does this look OK?

At first sight it looks okay, but unless there are specific aspects to talk about, I tend to prefer a proper review once the automated checks were successful.

It seems like CI is still failing due to coverage, typing and code style issues - this is something to consider in a second step. Nevertheless, regarding the code style, I would prefer to move the new classes into a submodule of pypdf.generic to not bloat the pypdf._writer module where possible.

I'm wondering if some users may already have written their own link patching code -- will this code break theirs? If so, should we make it possible to turn this behaviour off somehow?

In theory, each release could break the code of some user when doing special stuff. As this fixes shortcomings of the current implementation without removing anything, I do not see the need to introduce a deprecation period for this at the moment.

At the moment I'm resolving everything by searching lists to find corresponding indirect references. It would be much faster with a hash, but I haven't been able to make that work. Thoughts?

What exactly have you tried and what has been the result?

larsga · 2025-05-27T15:58:41Z

I tend to prefer a proper review once the automated checks were successful.

Yeah, sorry. I had to cook dinner, and now I have a meeting. I didn't intend to leave the build broken like this. The trouble is I can't get the right version of ruff on my laptop right now, so all the checks end up being done in CI, which is slow. Anyway, I will sort all this out.

I would prefer to move the new classes into a submodule of pypdf.generic to not bloat the pypdf._writer module where possible.

Will do.

I do not see the need to introduce a deprecation period for this at the moment.

Ack! 👍

What exactly have you tried and what has been the result?

Mainly that I couldn't get the reference lookups to work, but I take this to mean that they should work. I'll work on this a bit more.

larsga · 2025-05-28T13:20:16Z

Now it should finally be ready for review.

Note that I changed the type of the Destination.page property. As far as I can tell it's been wrong all the time. I certainly don't get an int back when I reference it. I get an IndirectObject, and that's also what the docs say should be passed in as the value. I was forced to do this to get type checking to accept my code -- let me know if you want this change separated out.

Once this PR is merged I'll look at handling links in merged-in pages, but because of upcoming holiday that will take a while.

stefan6419846 · 2025-05-28T13:26:22Z

Thanks. I will try to have a look at this as soon as possible - this might take some time as well.

Regarding the broken type hints, there is a corresponding issue as well: #3233.

stefan6419846

Thanks for the PR. I just had a first look at the changes and added some small comments. As general notes:

Using abbreviations in the names and docstrings should be avoided. It is completely fine to use "reference" instead of "ref" for example to improve clarity and avoid having to deprecate stuff later on.
In the type hints, please use type1, type2 instead of type1,type2. I have marked some cases, but not all.
Instead of nesting functions and bloating the already large modules, consider moving the corresponding functionality to the new submodule.

pypdf/_writer.py

tests/test_merger.py

pypdf/generic/_link.py

tests/test_merger.py

pypdf/generic/_link.py

stefan6419846 · 2025-07-01T10:08:25Z

Thanks for your patience - larger PRs do not play nicely when there is lots of other stuff to do from my side.

I did some small corrections myself and added a comment. Currently, there is a merge conflict as well. It this is resolved, I will have a final look at it and try getting it merged.

larsga · 2025-07-21T12:09:59Z

Your review came in just as I left for holiday, so this took a while.

I managed to resolve the conflicts, so I think this is ready now.

tests/test_merger.py

stefan6419846

No worries for the delay - larger PRs might take some more time from my side as well.

I just did some basic formatting changes and it looks ready to be merged for me now.

larsga · 2025-07-22T08:07:42Z

Fantastic to have this merged! Although this PR does not actually solve our problem. :) The next step is to also support link rewriting when pages have been merged.

larsga force-pushed the issue-3290 branch 2 times, most recently from d0b2c8a to 460139e Compare May 27, 2025 13:59

larsga force-pushed the issue-3290 branch 2 times, most recently from fb8c123 to 7e394e4 Compare May 27, 2025 14:10

larsga force-pushed the issue-3290 branch 15 times, most recently from 7274240 to 09ea9b0 Compare May 28, 2025 13:05

stefan6419846 requested changes Jun 4, 2025

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

pypdf/_writer.py Outdated Show resolved Hide resolved

pypdf/_writer.py Outdated Show resolved Hide resolved

tests/test_merger.py Outdated Show resolved Hide resolved

larsga force-pushed the issue-3290 branch 5 times, most recently from cd41ccd to c4486dd Compare June 18, 2025 09:09